Random Forest

Kristen Monaco, Praya Cheekapara, Raymond Fleming, Teng Ma

Introduction and Lit Review

Random Forest Introduction

  • Ensemble machine learning method based on a large number of decision trees voting to predict a classification
  • Benefits compared to decision tree:
    • Able to function with incomplete data
    • Lower likelihood of an overfit
    • Improved prediction accuracy

Random Forest Applications

  • Regression
    • Price Prediction
    • Demand Forecasting
    • Time-series Analysis

  • Remote Sensing
    • Vegetation Mapping
    • Environmental Monitoring
  • Bioinformatics
    • Disease Diagnosis
    • Protein Classification
    • Gene Expression

  • Banking and Business
    • Fraud Detection
    • Targeted Marketing

Lit Review

  • Lit Review

Methods

  • Bagging
  • Boosting
  • Random Feature Selection
  • Cross Validation

Random Forest Improvements

  • Bootstrap Sampling (Bagging)
    • Each decision tree uses a random sample of the original dataset
      • Using a subset of the dataset reduces the probability of an overfit model
      • Rows with missing data will often be left out of the sample, improving performance
      • Performed with replacement
  • Boosting
    • When individual models are trained in a sequential way, each model then learns the mistakes made by the preceding model.

Random Forest Improvements Continued

  • A random set of features is selected for each node in training
    • Information about feature importance may be saved and applies in future iterations
    • Even with automated random feature selection, feature selection and engineering prior to training may improve performance

Decision Tree and Cross Validation

  • Validation of performance of model
    • Resampling method similar to out of bag estimation
    • Allows estimation of the general performance of a model
  • Single decision trees may be visualized to help modelers understand the dataset

Decision Tree Visualization Code
 m3 <- rpart(
   formula = Group~ LF + GF + Biomes + Range +
     Habitat_degradation + Habitat_loss + IAS +
     Other + Unknown + Other + Over_exploitation,
   data    = species_train,
   method  = "anova"
 )
 rpart.plot(m3)

Data

  • South African Red List
    • Data about plants with their habitat, traits, distribution, and factors influencing their current threatened/extinct status
  • Purpose
    • Predict whether or not an unknown plant is threatened based on the above characteristics

Distribution of Range by Conservation Status

  • While there are a small number of threatened species with a large range, it is clear that Range is likely a strong predictor of Group status

  • A lower range predicts a higher likelihood of threatened or extinct grouping.

Distribution Visualization Code
ggplot(data = data, aes(x = Status, y = Range, fill = Status)) +
  geom_boxplot() +
  theme_bw() +
  ylim(0,100000)

Feature Associations

  • Cramer’s V Association with Range binned into 20 categories
    • Target feature Group is most associated with Range, Family, Habitat Loss, Biome, and GF
    • The most associated features will likely be the most important training features
    • Colinearity does not appear to be present

Association Visualization Code
corrDF <- train %>% mutate(Range=ntile(Range, n=20))
corrDF <- corrDF %>% mutate(across(c("Group","LF","GF","Range","Biomes","Range",
                                     "Habitat_degradation","Habitat_loss",
                                     "IAS","Other","Over_exploitation",
                                     "Pollution","Unknown"),as_factor))

corrplot::corrplot(DescTools::PairApply(corrDF,DescTools::CramerV), type='lower')

Analysis and Results

Data Preparation

  • Preprocessing
    • Encode categorical features into numerical / factor features
    • Split the training set into a training and test set,
    • Minimize class imbalance
Data Split Code
features <- data[, 1:14]
label <- data[, 15]

split <- sample.split(label, SplitRatio = 0.7)
features_train = features[split,]
features_test = features[!split,]
label_train = label[split]
label_test = label[!split]

data_train <- features_train
data_train$label <- label_train
  Raw       Factor
1 Parasitic 1     
2 Tree      2     
3 Suffrutex 3     
Factorization Table Code
corrDF <- train %>% mutate(Range=ntile(Range, n=20))
corrDF <- corrDF %>% mutate(
             across(
               c("Group","LF","GF","Range",
                 "Biomes","Range",
                 "Habitat_degradation",
                 "Habitat_loss", "IAS","Other",
                 "Over_exploitation","Pollution",
                 "Unknown"),as_factor))

printFactors=matrix(c(train$GF[1],train$GF[2],
                      train$GF[3],corrDF$GF[1],
                      corrDF$GF[2],corrDF$GF[3]),
                  nrow=3)
colnames(printFactors)=c('Raw','Factor')
rownames(printFactors)=c('1','2','3')
print(printFactors,quote=FALSE)

Class Balancing

  • Class Imbalance
    • Resample smaller classes in order to approximate equal classes
    • Training on imbalanced datasets will bias predictions to the larger class
Group Counts Pre-Balancing:   490 148 23 
Group Counts Post-Balancing:  490 490 490
Code
AB <- data_train
AB <- AB[AB$label != '3',]
AB_res <- ovun.sample(label ~ ., data = AB, 
                      method = "over", N = 980,
                      seed = 1)$data

AC <- data_train
AC <- AC[AC$label != '2',]
AC_res <- ovun.sample(label ~ ., data = AC,
                      method = "over", N = 980,
                      seed = 1)$data

AB_2 <- AB_res[AB_res$label == '2',]
AC_3 <- AC_res[AC_res$label == '3',]

data_train_1 <- AB_res[AB_res$label == '1',]
data_train_combined <- rbind(data_train_1, AB_2, AC_3)

cat("Group Counts Pre-Balancing:  ", 
    table(data_train$label),
    "\nGroup Counts Post-Balancing: ", 
    table(data_train_combined$label))

Model Training

\(X_{new}=\frac{X_{old}-\min(X_{old})}{\max(X_{old})-\min(x_{old})}\)

          Score
Accuracy   0.93
Recall     0.87
Precision  0.89
F1         0.88
  • What is Min-Max
Code
features_train_1 <- as.data.frame(lapply(features_train, 
                      function(x) {(x-min(x))/(max(x)-min(x))}))
features_test_1 <- as.data.frame(lapply(features_test, 
                      function(x) {(x-min(x))/(max(x)-min(x))}))

data_train_1 <- features_train_1
data_train_1$label <- label
class_counts_1 <- table(data_train_1$label)

model_1 <- randomForest(x = data_train_1[-ncol(data_train_combined)],
                        y = as.factor(data_train_1$label), ntree = 2)
variable_importance_1 = importance(model_1)
pred_comb_1 <- predict(model_1, features_test_1)
accuracy <- sum(label_test == pred_comb_1) / length(label_test)
label_test_factor <- as.factor(label_test)
pred_comb_1_factor <- as.factor(pred_comb_1)
cm <- confusionMatrix(pred_comb_1_factor, label_test_factor)
recall <- mean(c(cm$byClass["Class: 1", "Sensitivity"], 
                 cm$byClass["Class: 2", "Sensitivity"], 
                 cm$byClass["Class: 3", "Sensitivity"]))
precision <- mean(c(cm$byClass["Class: 1", "Pos Pred Value"], 
                    cm$byClass["Class: 2", "Pos Pred Value"], 
                    cm$byClass["Class: 3", "Pos Pred Value"]))
F1 = 2 * recall * precision / ( recall + precision )
printTable=matrix(c(round(accuracy,2),round(recall,2),
                    round(precision,2),round(F1,2)),
                  ncol=1,byrow=TRUE)
colnames(printTable)=c('Score')
rownames(printTable)=c('Accuracy','Recall',
                       'Precision','F1')
print(printTable)

\(X_{new}=\frac{X_{old}-\bar{X}_{old}}{\sigma_{X_{old}}}\)

          Score
Accuracy   0.33
Recall     0.40
Precision  0.31
F1         0.35
Code
features_train_2 <- as.data.frame(lapply(features_train, 
                      function(x) {(x - mean(x))/sd(x)}))
features_test_2 <- as.data.frame(lapply(features_test, 
                      function(x) {(x - mean(x))/sd(x)}))

data_train_2 <- features_train_2
data_train_2$label <- label
class_counts_2 <- table(data_train_2$label)

model_2 <- randomForest(x = data_train_2[-ncol(data_train_combined)],
                      y = as.factor(data_train_2$label), ntree = 2)
variable_importance_2 = importance(model_2)
pred_comb_2 <- predict(model_2, features_test_2)
accuracy <- sum(label_test == pred_comb_2) / length(label_test)
label_test_factor <- as.factor(label_test)
pred_comb_1_factor <- as.factor(pred_comb_2)
cm <- confusionMatrix(pred_comb_1_factor, label_test_factor)
recall <- mean(c(cm$byClass["Class: 1", "Sensitivity"],
                              cm$byClass["Class: 2", "Sensitivity"],
                              cm$byClass["Class: 3", "Sensitivity"]))

precision <- mean(c(cm$byClass["Class: 1", "Pos Pred Value"],
                            cm$byClass["Class: 2", "Pos Pred Value"],
                            cm$byClass["Class: 3", "Pos Pred Value"]))
F1 = 2 * recall * precision / ( recall + precision )
printTable=matrix(c(round(accuracy,2),round(recall,2),
                    round(precision,2),round(F1,2)),
                  ncol=1,byrow=TRUE)
colnames(printTable)=c('Score')
rownames(printTable)=c('Accuracy','Recall','Precision','F1')
print(printTable)

\(X_{new}=\frac{X_{old}}{\max(|X_{old}|)}\)

          Score
Accuracy   0.88
Recall     0.80
Precision  0.79
F1         0.79
Code
features_train_3 <- as.data.frame(lapply(features_train, 
                       function(x) {x / max(abs(x))}))
features_test_3 <- as.data.frame(lapply(features_test, 
                        function(x) {x / max(abs(x))}))

data_train_3 <- features_train_3
data_train_3$label <- label
class_counts_3 <- table(data_train_3$label)

model_3 <- randomForest(x = data_train_3[-ncol(data_train_combined)], 
                        y = as.factor(data_train_3$label), 
                        ntree = 2)
variable_importance_3 = importance(model_3)
pred_comb_3 <- predict(model_3, features_test_3)
accuracy <- sum(label_test == pred_comb_3) / length(label_test)
label_test_factor <- as.factor(label_test)
pred_comb_1_factor <- as.factor(pred_comb_3)
cm <- confusionMatrix(pred_comb_1_factor, label_test_factor)
recall <- mean(c(cm$byClass["Class: 1", "Sensitivity"],
                 cm$byClass["Class: 2", "Sensitivity"],
                 cm$byClass["Class: 3", "Sensitivity"]))
precision <- mean(c(cm$byClass["Class: 1", "Pos Pred Value"],
                 cm$byClass["Class: 2", "Pos Pred Value"],
                 cm$byClass["Class: 3", "Pos Pred Value"]))
F1 = 2 * recall * precision / ( recall + precision )
printTable=matrix(c(round(accuracy,2),round(recall,2),
                    round(precision,2),round(F1,2)),
                  ncol=1,byrow=TRUE)
colnames(printTable)=c('Score')
rownames(printTable)=c('Accuracy','Recall','Precision','F1')
print(printTable)

\(X_{new}=\frac{X_{old}-\min(X_{old})}{\max(X_{old})-\min(x_{old})}\)

          Score
Accuracy   0.77
Recall     0.60
Precision  0.55
F1         0.57
Code
features_train_4 <- as.data.frame(lapply(features_train, 
                       function(x) {x / sum(abs(x))}))
features_test_4 <- as.data.frame(lapply(features_test, 
                       function(x) {x / sum(abs(x))}))

data_train_4 <- features_train_4
data_train_4$label <- label
class_counts_4 <- table(data_train_4$label)

model_4 <- randomForest(x = data_train_4[-ncol(data_train_combined)], 
                        y = as.factor(data_train_4$label), ntree = 2)
variable_importance_4 = importance(model_4)
pred_comb_4 <- predict(model_4, features_test_4)
accuracy <- sum(label_test == pred_comb_4) / length(label_test)
label_test_factor <- as.factor(label_test)
pred_comb_1_factor <- as.factor(pred_comb_4)
cm <- confusionMatrix(pred_comb_1_factor, label_test_factor)
recall <- mean(c(cm$byClass["Class: 1", "Sensitivity"],
                              cm$byClass["Class: 2", "Sensitivity"],
                              cm$byClass["Class: 3", "Sensitivity"]))
precision <- mean(c(cm$byClass["Class: 1", "Pos Pred Value"],
                            cm$byClass["Class: 2", "Pos Pred Value"],
                            cm$byClass["Class: 3", "Pos Pred Value"]))
F1 = 2 * recall * precision / ( recall + precision )
printTable=matrix(c(round(accuracy,2),round(recall,2),
                    round(precision,2),round(F1,2)),
                    ncol=1,byrow=TRUE)
colnames(printTable)=c('Score')
rownames(printTable)=c('Accuracy','Recall','Precision','F1')
print(printTable)

\(X_{new}=\frac{X_{old}-\bar{X}_{old}}{\sigma_{X_{old}}}\)

          Score
Accuracy   0.75
Recall     0.62
Precision  0.53
F1         0.57
Code
features_train_5 <- as.data.frame(lapply(features_train, 
                       function(x) {x / sqrt(sum(x^2))}))
features_test_5 <- as.data.frame(lapply(features_test, 
                       function(x) {x / sqrt(sum(x^2))}))

data_train_5 <- features_train_5
data_train_5$label <- label
class_counts_5 <- table(data_train_5$label)

model_5 <- randomForest(x = data_train_5[-ncol(data_train_combined)], 
                        y = as.factor(data_train_5$label), ntree = 2)
variable_importance_5 = importance(model_5)
pred_comb_5 <- predict(model_5, features_test_5)
accuracy <- sum(label_test == pred_comb_5) / length(label_test)
label_test_factor <- as.factor(label_test)
pred_comb_1_factor <- as.factor(pred_comb_5)
cm <- confusionMatrix(pred_comb_1_factor, label_test_factor)

recall <- mean(c(cm$byClass["Class: 1", "Sensitivity"],
                 cm$byClass["Class: 2", "Sensitivity"],
                 cm$byClass["Class: 3", "Sensitivity"]))

precision <- mean(c(cm$byClass["Class: 1", "Pos Pred Value"],
                 cm$byClass["Class: 2", "Pos Pred Value"],
                 cm$byClass["Class: 3", "Pos Pred Value"]))

F1 = 2 * recall * precision / ( recall + precision )

printTable=matrix(c(round(accuracy,2),round(recall,2),
                    round(precision,2),round(F1,2)),ncol=1,
                    byrow=TRUE)
colnames(printTable)=c('Score')
rownames(printTable)=c('Accuracy','Recall','Precision','F1')

print(printTable)

Prediction Method

  • Each independently trained decision tree produces its own prediction
    • Trained on different subsets of both data and features
  • The results from each decision tree are combined into a voting classifier using a Mode function
    • \(y_{final}=\text{mode}(\{y_1,y_2,y_3,y_4,y_5\})\)
  • Iterate over entire test set, storing results
  • Generate a confusion matrix, calculate the sensitivity, and precision for each category

Prediction

          Score
Accuracy   0.90
Recall     0.82
Precision  0.76
F1         0.79
  • Prediction using a combination of trees leads to a better accuracy on all metrics
Code
n <- length(pred_comb_1)
final_pred <- rep(NA, n)
for(i in 1:n) {
   preds <- c(pred_comb_1[i], pred_comb_2[i], pred_comb_3[i], 
              pred_comb_4[i], pred_comb_5[i])
   final_pred[i] <- as.numeric(names(which.max(table(preds))))
}

accuracy <- sum(label_test == final_pred) / length(label_test)

final_pred_factor <- as.factor(final_pred)
label_test_factor <- as.factor(label_test)
cm_vote <- confusionMatrix(final_pred_factor, label_test_factor)

sensitivity_class1 <- cm_vote$byClass["Class: 1", "Sensitivity"]
sensitivity_class2 <- cm_vote$byClass["Class: 2", "Sensitivity"]
sensitivity_class3 <- cm_vote$byClass["Class: 3", "Sensitivity"]
recall = (sensitivity_class1 + sensitivity_class2 + sensitivity_class3)/3

precision_class1 <- cm_vote$byClass["Class: 1", "Pos Pred Value"]
precision_class2 <- cm_vote$byClass["Class: 2", "Pos Pred Value"]
precision_class3 <- cm_vote$byClass["Class: 3", "Pos Pred Value"]
precision = (precision_class1 + precision_class2 + precision_class3)/3

F1 = 2 * recall * precision / ( recall + precision )

printTable=matrix(c(round(accuracy,2),round(recall,2),
                    round(precision,2),round(F1,2)),ncol=1,
                    byrow=TRUE)
colnames(printTable)=c('Score')
rownames(printTable)=c('Accuracy','Recall','Precision','F1')
print(printTable)

Feature Importance

  • Range is the most important feature to predict Group Status
    • Habitat Loss and Growth Form are also important features
    • Life Form, Other, and Unknown are unimportant for prediction
ctrl <- trainControl(method = "cv",  
                     number = 10) 

bagged_cv <- train(
  Group~ LF + GF + Biomes + Range +
    Habitat_degradation + 
    Habitat_loss + IAS + Other +
    Unknown + Other +
    Over_exploitation,
  data    = species_train,
  method = "treebag",
  trControl = ctrl,
  importance = TRUE
)
 
plot(varImp(bagged_cv), 10) 

Evaluation

\(\text{Accuracy}\)

\(\text{Recall}\)

\(\text{Precision}\)

\(\text{F1}\)

\(=\frac{\sum{\left(\text{Actual Label} = \text{Predicted Label}\right)}}{\text{Label Count}}\)

\(=\frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}\)

\(=\frac{\text{True Positives}}{\text{True Positives}+\text{False Positives}}\)

\(=\frac{2*(\text{Precision}*\text{Recall})}{\text{Precision}+\text{Recall}}\)

Confusion Matrix and Statistics

                 Score      
Accuracy         0.9        
Accuracy p-value <.001      
95% CI           (0.86,0.93)
Kappa            0.75       
                       LC  Thr  Ext
Sensitivity          0.97 0.68 0.80
Specificity          0.84 0.97 0.97
Pos Pred Value       0.94 0.86 0.47
Neg Pred Value       0.91 0.91 0.99
Precision            0.94 0.86 0.47
Recall               0.97 0.68 0.80
F1                   0.96 0.76 0.59
Prevalence           0.74 0.22 0.04
Detection Rate       0.72 0.15 0.03
Detection Prevalence 0.76 0.18 0.06
Balanced Accuracy    0.90 0.83 0.88

Confusion Matrix Code
cm_vote <- confusionMatrix(final_pred_factor, label_test_factor)
cm <- confusionMatrix(final_pred_factor, label_test_factor)

cm_d <- as.data.frame(cm$table)
cm_st <-data.frame(cm$overall)
cm_st$cm.overall <- round(cm_st$cm.overall,2)
cm_d$diag <- cm_d$Prediction == cm_d$Reference
cm_d$ndiag <- cm_d$Prediction != cm_d$Reference     
cm_d[cm_d == 0] <- NA
cm_d$Reference <-  reverse.levels(cm_d$Reference)
cm_d$ref_freq <- cm_d$Freq * ifelse(is.na(cm_d$diag),-1,1) 
 
plt1 <-  ggplot(data = cm_d, aes(x = Prediction , y =  Reference, 
                                 fill = Freq))+
  scale_x_discrete(position = "top") +
  geom_tile( data = cm_d,aes(fill = ref_freq)) +
  scale_fill_gradient2(guide = FALSE ,low="red",high="mediumvioletred", 
                       mid= "mistyrose",
                    midpoint = 0,na.value = 'white') +
  geom_text(aes(label = Freq), color = 'black', size = 3)+
  theme_bw() +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        legend.position = "none",
        panel.border = element_blank(),
        plot.background = element_blank(),
        axis.line = element_blank(),
  )
plt2 <-  tableGrob(cm_st)
grid.arrange(plt1, plt2, nrow = 1, ncol = 2,
            top=textGrob("Confusion Matrix",gp=gpar(fontsize=25,font=1)))
Table Print Code
printTable=matrix(c(round(cm$overall['Accuracy'],2),
                    if(cm$overall['AccuracyPValue']<0.001){"<.001"}
                    else round(cm$overall['AccuracyPValue'],3),
                    paste("(",round(cm$overall['AccuracyLower'],2),
                          ",",round(cm$overall['AccuracyUpper'],2),
                          ")",sep=""),
                    round(cm$overall['Kappa'],2)),ncol=1,
                    byrow=TRUE)
colnames(printTable)=c('Score')
rownames(printTable)=c('Accuracy','Accuracy p-value','95% CI','Kappa')
print(printTable,quote=FALSE)

cmBC<-cm$byClass
rownames(cmBC)<-c("LC","Thr","Ext")
print(t(round(cmBC,2)),quote=FALSE)

Conclusion

  • Predictive Model with 88% Accuracy.
  • Range was the strongest predictor of extinction.
    • Cape Floristic region (Fynbos biome)
    • Perennial shrubs
  • Habitat loss was the 2nd most important variable.
  • Further studies could include increased surveillance of threats, species monitoring, and ecological value of plant biodiversity.